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Abstract 

The paper introduces a methodology for visualizing on a dimension reduced subspace the 
classification structure and the geometric characteristics induced by an estimated Gaussian 
mixture model for discriminant analysis. In particular, we consider the case of mixture of 
mixture models with varying parametrization which allow for parsimonious models. The 
approach is an extension of an existing work on reducing dimensionality for model-based 
clustering based on Gaussian mixtures. Information on the dimension reduction subspace is 
provided by the variation on class locations and, depending on the estimated mixture model, 
on the variation on class dispersions. Projections along the estimated directions provide 
summary plots which help to visualize the structure of the classes and their characteristics. 
A suitable modification of the method allows us to recover the most discriminant directions, 
i.e., those that show maximal separation among classes. The approach is illustrated using 
simulated and real data. 

Keywords: Dimension reduction, Model-based discriminant analysis, Gaussian mixtures, 
Canonical variates for mixture modeling 


1 Introduction 


Discriminant analysis or supervised learning indicates a broad set of statistical methods aimed 
at classifying a categorical outcome variable Y, an indicator with K classes, on the basis of a 
(px 1) vector of features x. Among the several methods available for continuous features, one 
of the most popular approach is classical linear discriminant analysis (LDA). This method has 
been extended to quadratic discriminant analysis (QDA), and, more generally, to models based 
on finite mixture modeling of Gaussian densities. 

Independently from the statistical method adopted, visualization and graphics can play an 
important role in the understanding of the classification results. Typically, canonical variates 
are computed when the dimension of the features space is large. This allows us to visualize the 
classes on a reduced subspace, often bi-dimensional. However, canonical variates are tailored to 
LDA. Some methods have been proposed for QDA, while graphical methods for finite mixture 


modeling is still a research area to be explored. From a different point of view, Hennig (2004) has 


proposed an asymmetric discriminant projection method by looking at the projections where a 
class appears as homogeneous as possible and separated from the remaining groups. 

In this paper a dimension reduction method for visualizing and summarizing the fit of a 
model-based mixture discriminant analysis is discussed. The approach is an extension of the 


method proposed by Scrucca (2010) for model-based clustering. The estimated subspace is found 


by looking at the variation in class means and class covariances depending upon the assumed 
parameterization of the fitted Gaussian mixture model. The resulting projection subspace is able 
to capture most of the classification structure available in the data. The proposal reduces to LDA 
canonical variates for a specific parameterization of the mixture model, while it is equivalent to 
a recently proposed method for QDA. In all the other cases, the proposed visualization method 
is able to show the main geometric characteristics of the fitted mixture model. Furthermore, 
the proposal can be adapted to allow for the visualization of the separation among the classes. 
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The paper is organized as follows. In Section 2 a brief review of classification and graphical 
methods based on the Gaussian distribution is provided. Section 3 describes the Gaussian 
mixture models for discriminant analysis we consider in this paper. In Section 4 the methodology 
is introduced and the main properties are described. Section 5 presents examples based on 
both simulated and real data. In Section 6 the proposed method is extended to allow to 
recover the most discriminant directions, i.e., those that show maximal separation among classes. 
Concluding remarks appear in the final Section. 


2 Classification based on the Gaussian distribution and existing 
graphical methods 

All the models discussed in this paper are probabilistic, i.e., based on the assumption that 
the observations in the kth class (k = 1 are generated by a probability distribution 

fk(x). where x = (x\,X 2 , • • ■, £ P ) T is a column vector of p observed features. Most discriminant 
analysis methods for continuous variables are based on the assumption that observations in each 
class are multivariate normal, so that fk{ x ) = 4 > { x \k i ki E*.), where <j) is the p-variate Gaussian 
density with mean p, k and covariance matrix 5]*,. 

Linear discriminant analysis (LDA) assumes normal populations with equal class covariance 
matrices, i.e., = Spy for all k = where Spy = Yl^k^k^k is the within-class 

covariance matrix with class prior probabilities Tik- The resulting discriminant function is linear 
in the feature vector x and the acceptance regions for the classes are separated in by 


means of hyperplanes. However, Fisher’s (1936) original proposal did not rely on the Gaussian 


distribution. Based on geometric arguments, he looked for a vector of d linear combinations 
/3 T x, with ft E M pxrf , such that the between-class covariance, ~ rXaR — /-0 T 

with fi = 7Tfc/ifc, is maximized relative to the within-class covariance, This amounts 

to maximize the so called Rayleigh quotient , i.e., 

P T v B ft 

argmax .. —— 

p ft T ^wft 

or, equivalently, find ft E W xd which maximizes ft T S#/3 subject to /3 T Epy/3 = Id, where Id is 
the identity matrix of dimension (d x d). The problem is solved by the generalized eigenvalue 
decomposition of with respect to Epy. 

The directions given by the d columns of ft form the basis of the d-dimensional reduction 
subspace, S{( 3), which shows the maximal separation among classes, and decision boundaries are 
linear in the projected features subspace. The dimension of this subspace is d = min(p, K — 1), 
so just one direction can be estimated in two-class problems. Fisher’s or LDA canonical variates, 
ft T x, express the projection onto this subspace, and provide a graphical counterpart to LDA 


(Mardia et al 1979, Chap. 11). 

There exists some connection between LDA canonical variates and other dimension reduction 
methods. In particular, it has been shown that for a categorical response variable sliced inverse 
regression (SIR; Li, 1991) is equivalent to LDA, except for a different scaling. In fact, SIR 


covariates are scaled to have unit covariance while the LDA canonical variates are scaled to 


have unit within-class covariance (Chen and Li, 2001). See also Kent (1991) for more discussion 


on the connection between SIR and LDA. 

Quadratic discriminant analysis (QDA) is obtained by removing the assumption of a common 
class covariance matrix. The resulting discriminant function is quadratic, and the decision 
boundaries are quadratic surfaces over the features subspace. However, in this case there appears 
to be no standard canonical variates analysis for QDA as there is for LDA. Some authors have 
considered dimension reduction methods for quadratic discrimination in normal populations 
with different covariance matrices. Pardoe et al (2007) proposed the use of sliced average 
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variance estimation (SAVE; Cook and Weisberg, 1991) as a graphical representation in quadratic 


discriminant analysis. Velilla (2008, 2010) discussed the concept of quadratic subspace as a tool 


for dimension reduction in QDA. 


Other extensions to LDA are regularized discriminant analysis (RDA, Friedman 1989), 


flexible discriminant analysis (FDA, Hastie et al, 1994), and penalized discriminant analysis 


(PDA, Hastie et al 1995). RDA represents a compromise between LDA and QDA; it uses a 


tuning parameter a for class covariance matrix estimation, i.e., the covariance matrix for class 
k is estimated by the convex combination 


— a ^k + (1 — 


w- 


FDA fits by optimal scoring a linear regression model using a basis expansion h(x) of the 
feature vector x. PDA also uses optimal scoring on a basis expansion h(x ) as in FDA, but with 
a quadratic penalty on the coefficients, i.e., solving the following optimization problem 


argmax /3 T 51^/3 
P 


subject to /3 T (Sm + A Cl) (3 = Id 


where fi is a (p x p) symmetric, nonnegative definite, penalty matrix. All these methods have 
no direct graphical representation associated, so usually canonical variates are computed as in 
LDA using the estimated class means. 


Finally, we mention the LAD proposal by Cook and Forzani (2009), a likelihood-based di 


mension reduction method under the assumption of conditional normality of predictors given 
the response. This model closely resembles the family of models we adopted, but the estima¬ 
tion procedure is quite different. In fact, no closed-form solution to the maximum likelihood 
estimation of the central subspace (the parameter of interest) is available. Thus, numerical 
optimization is used for maximization of the log-likelihood on Grassman manifolds. 


3 Finite mixture modelling in discriminant analysis 

Mixture discriminant analysis generalizes the previous approaches by allowing the density for 
each class conditional density to be expressed by a finite mixture of normals. Thus, a Gaussian 
mixture model for the £;-th class (k = 1 ,..., K ) has density 


G k 


fk {®) 'y '^ Ttgk&jtt i V'qki'^gk)') 


( 1 ) 


9=1 


where Tv g k are the mixing probabilities (tv g k > 0, Yl g =i n gk = 1), P-gk i s the mean of component 
g in class k, and is the covariance matrix of component g in class k. Thus, Gaussian 
components are ellipsoidal, centered at fj, k , and with other geometric features, such as volume, 
shape and orientation, determined by S 


gk- 


Hastie and Tibshirani (1996) introduced mixture discriminant analysis (MDA) assuming a 

with known fixed number of mixture 


common full covariance matrix, i.e. S gk = 5] for all g , k 
components for each class. 


In a procedure named eigenvalue decomposition discriminant analysis (EDDA), Bensmail 


and Celeux (1996) proposed the use of Gaussian finite mixture modeling for discriminant anal¬ 


ysis in which each class is modeled by a single Gaussian term, i.e., G k = 1 for all k, with the 
same (possibly parsimonious) class covariance structure factorized as 


Sfc = A k D k A k D 


T 
k ? 


where X k is a scalar value controlling the volume of the ellipsoid, A k is a diagonal matrix 
specifying the shape of the density contours, and D k is an orthogonal matrix which determines 


3 



































Table 1: Parametrizations of covariance matrices available in the mclust software (Fraley et al 
2012) and related geometric characteristics. 


Model 

^k 

Distribution 

Volume 

Shape 

Orientation 

E 

<j 

Univariate 

equal 



V 

& k 

Univariate 

variable 



Eli 

XI 

Spherical 

equal 

equal 


VII 

A k I 

Spherical 

variable 

equal 


EEI 

A A 

Diagonal 

equal 

equal 

coordinate axes 

VEI 

XkA 

Diagonal 

variable 

equal 

coordinate axes 

EVI 

AA fc 

Diagonal 

equal 

variable 

coordinate axes 

VVI 


Diagonal 

variable 

variable 

coordinate axes 

EEE 

A DAD t 

Ellipsoidal 

equal 

equal 

equal 

EEV 

A D k AD\ 

Ellipsoidal 

equal 

equal 

variable 

VEV 

A k D k ADj 

Ellipsoidal 

variable 

equal 

variable 

VVV 

^ k D k -A k D 

Ellipsoidal 

variable 

variable 

variable 


the orientation of the ellipsoid (Banfield and Raftery 1993; Celeux and Govaert 1995). Table [T| 


shows the MCLUST family of mixture models supported by the mclust package (Fraley et al 


2012) for the R software (R Core Team, 2013) 


A generalization of the previous two approaches is the method called MclustDA (Fraley and 


Raftery, 2002), where a density estimate for the data is obtained by a Gaussian finite mixture 


model with a different number of components and a different (possibly parsimonious) covariance 
matrix for each class. The corresponding family is thus very flexible allowing the distribution 
of each class to be approximated by a mixture of Gaussian components. 

Maximum likelihood estimates for finite mixture models can be computed via the EM algo¬ 


rithm (Dempster et al, 1977). Model selection, which requires choosing the number of mixture 


components and the covariance parameterization for each class, is usually based on penalized 


criteria, such as the Bayesian information criterion (BIC, Schwartz, 1978) or the integrated 


complete likelihood (ICL, Biernacki et al, 2000). 


4 Dimension reduction for model-based discriminant analysis 

4.1 Methodology 

Suppose we would like to find a suitable reduced number of projections which, depending on 
the estimated Gaussian mixture model, are able to visualize variation both in groups location 
and dispersion. Following the proposal of Scrucca (2010) for model-based clustering, consider 
the following matrices: 

K G k 

M \ ='52^2 UJ gk(Vgk ~ hXhgfc - tO T > 

fc=l g =1 

where u gk = n k 7T gk ( u gk > 0, Y,k, g u gk = 1), h = Ylk^kVk = Y.k^^gkHgk is the marginal 
mean vector, /z fc = Ylg= l ^gk^gk is the mean vector for class k, and 


K G k 


M\\ = ^ — H)S x 1 (5] 3fc - £) t , 

k= 1 g =l 

where E! = J2 k g^gk^gk is the pooled within-class covariance matrix, and Sx = A Y^i=ii x i — 
n)(xi — fi) T is the marginal covariance matrix. 
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The matrix M\ contains information on class-component means variation, while M\\ con¬ 
tains information on class-component covariances variation. The two types of information can 
be summarized using the following kernel matrix 

M = + M\\. (2) 

The matrix (3 e M. pxd ( d = min(p, Gk — 1) spanning the desired subspace is the solution 

of the following optimization 

argmax /3 T M(3 subject to /3 T H\-/3 = Id, (3) 

/3 

where Id is the (d x d ) identity matrix. The solution of ^ is obtained through the generalized 
eigen-decomposition of M with respect to Sj. Hence, the basis [3 of the dimension reduction 
subspace S((3) is obtained as S Y " times the eigenvectors of S Y ' MY, X 7 , with directions 
ordered according to the corresponding eigenvalues. The projections onto such subs pace is then 
given by z = /3 T x. In analogy with the name of the method proposed in Scrucca (2010), 
these are called GMMDRC (Gaussian Mixture Model Dimension Reduction for Classification) 
variables and provide a graphical method to display the classification structure resulting from 
a Gaussian mixture model. 

Note that in the presentation made so far we have assumed that the parameters of the 
population are known. Usually, however, they are unknown and must be estimated from training 
data as discussed in Section 14.31 


Scrucca (2010 


4.2 Properties 

For MDA models the subspace spanned by M is equivalent to that spanned by M\. This 
because under the MDA assumption of common class covariance, i.e., 'Eigk = XI for all g,k , we 
get M H = 0, so no contribution conies from the variation on class covariances. The same also 
happens for those EDDA models which assume constant class covariance matrices (i.e., models 
Eli, EEI, and EEE - see Table [TJ, because here = El for all k. In all the other cases, 
i.e., for those mixture models which allow different class covariance matrices, M I, adds further 
information for the identification of the reduction subspace. 

In two specific cases the subspace identified by GMMDRC reduces to simple known situations 
as described in the following propositions, whose proofs are contained in the Appendix. 


Proposition 1 Consider the EDDA mixture model with common full class covariance matrix 
(EEE). The subspace S((3), obtained by solving the GMMDRC constrained optimization in © 
with M = M/'S x 1 AIi , is identical to the subspace S((3 sir ) spanned by SIR, and also to the 
subspace S((3 LDA ) spanned by LDA canonical directions. 


Based on Proposition[l]we claim that using canonical LDA variables is only relevant when the 
adopted mixture model for classification assumes a single component with common covariance 
matrix for each class (see also 


Chen and Li, 2001 


Proposition 2 Consider the EDDA mixture model with different full class covariance matrices 
(VVV). The subspace S(/3), obtained by solving the GMMDRC constrained optimization in © 
with M as in ([2]), is identical to the subspace S(/3 save ) spanned by SAVE. 


Noting that an EDDA mixture model with a different full covariance matrix for each class 
is essentially equivalent to QDA, Proposition [2] supports the use of SAVE as a graphical coun¬ 
terpart to QDA. 
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Proposition 3 Let l\ > I 2 > • • • > Id > 0 be the eigenvalues from the generalized eigen- 
decomposition of the kernel M, i.e., Mf3j = IjUxflj, for j = 1 ,...,d. Each eigenvalue lj, 
corresponding to the direction (3j of the projection subspace S((3), can be written as 

lj = Var(E(/3j x\Y)) 2 + E(Va,r(f3]x\Y) 2 ) for j = 1,..., d. (4) 

Thus, the eigenvalues can be decomposed in the sum of the contributions given by 

• the squared variance of the between class-component means, 

• the average of the squared within class-component variances, 
along the corresponding directions. 


This result provides an interpretation for the contribution of each direction to the visualiza¬ 
tion of the classification structure. Along each GMMDRC direction classes can be separated by 
location, by dispersion, or both. In addition, those directions associated with zero or approxi¬ 
mately zero eigenvalues can be neglected since their contribution to class location or dispersion 
is negligible. A formal assessment of dimensionality could be pursued, for instance, by imple¬ 
menting a permutation test as described in Li (1991), however, is beyond the scope of this paper 
and deserves further study and investigation. 


4.3 Estimation 

For a (n x p) sample data matrix X and the corresponding (n x 1) vector Y containing the 
observed classes, the sample version M of Q is obtained using the corresponding estimates 
from the fit of a Gaussian finite mixture model among those discussed in Section [3j Then, 
sample GMMDRC directions are calculated from the generalized eigen-decomposition of M 
with respect to Sj, the sample marginal covariance matrix. 


5 Examples 

5.1 Waveform data 


This is an artificial three-class problem with p = 21 variables, often used in machine learning 


and considered to be a difficult pattern recognition problem (Breiman et al 1984 Hastie and 


Tibshirani, 1996). Consider the following three shifted triangular waveforms defined as 
m(j) = max(6 -\j - 11|,0), w 2 (j) = wi(j - 4), w 3 (j) = w^j + 4), 


for j = 1,..., 21. Then, the variables Xj are generated within each class Y as a random convex 
combination of two basic waveforms with noise added: 



UlWl(j) + (1 - ui)w 2 (j) + 
U 2 W 2 (j) + (1 - U 2 )w 3 (j) + 
U 3 W 3 {j ) + (1 - u 3 )wi(j) + 


G 


G 


G 


for Y = 1 
for Y = 2 , 
for Y = 3 


where j = 1,2,..., 21, Wh = (rc/i(l),..., rc/ l (21)) T for h = 1,2,3, (u\,u 2 ,u 3 ) be independent 
random variables uniformly distributed on [0,1], and €j following a standard normal distribution. 

Figure [l] shows a scatterplot of data points projected onto the directions estimated for the 
EDDA mixture model with EEE covariance structure, i.e., assuming a common class covari¬ 
ance. As already mentioned, this is equivalent to a plot of LDA canonical variates. Panel (a) 
contains the density contours for the three classes, which have the same shape, orientation, and 
volume. The graph on panel (b) displays the corresponding decision boundaries with associated 
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Figure 1: Plot of waveform data projected onto the first two directions for the EDDA model 
with common class covariance. Panel (a) shows the class density contours, while panel (b) the 
decision boundary with corresponding uncertainty. 


classification uncertainty, with the uncertainty shown using a greyscale where darker regions 
indicating higher uncertainty. As expected the decision boundaries are linear. 

Moving to a more complex model, we fitted an EDDA mixture model with VVV covariance 
structure, i.e., different class covariances. The corresponding projection is shown in Figure [2] 
In this case, the contours have different orientation (see panel (a)) because no restrictions 
were placed on the class covariance matrices, hence the estimated model provides a better 
approximation to the data distribution. The triangular form of the data appears more clearly 
than in the previous case. The plot on panel (b) contains the classification boundaries given by 
quadratic polygons, which shows a lower overall uncertainty than in the previous case. 




Dirl 


Dirl 


(a) (b) 

Figure 2: Plot of waveform data projected onto the first two directions for the EDDA model 
with different class covariances. Panel (a) shows the class density contours, while panel (b) the 
decision boundary with corresponding uncertainty. 
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Finally, Figure [3] shows the data projected onto the first two directions estimated from the 
selected MclustDA mixture model. This model fitted a mixture of three class-specific Gaussian 
mixtures where the class-specific mixtures had G± = 3, G 2 = 4, and G 3 = 3 spherical Gaussian 
distributions as components. These characteristics are clearly visible on panel (a) of Figure [3j 
The resulting decision boundaries are shown on Figure these appear to be highly nonlinear 
with a relative larger uncertainty where the classes overlap. Finally, note that, on the basis of 
the corresponding eigenvalues, the first two directions account for 96% of the overall information 
available in the 21 estimated directions. 



(a) 


(b) 


Figure 3: Plot of waveform data projected onto the first two directions for the selected MclustDA 
model. Panel (a) shows the class density contours, while panel (b) the decision boundary with 
corresponding uncertainty. 


5.2 Swiss banknotes 

Table 1.1 and 1.2) presented a dataset containing six physical mea¬ 
surements made on a sample of 1000 Swiss Franc bills. 100 observations were classified as 
genuines, and 100 as counterfeits. 

The EDDA mixture model selected by BIC is a EEV model, which assume class covariance 
structures with different orientations but same volume and shape (see Table [I]). Figure^ shows 
the data projected onto the first two GMMDRC directions with the corresponding density 
contours; there appears a clear separation between classes with a larger variability for the 
group of counterfeit banknotes. The corresponding classification boundary is quadratic with 
an outlying genuine note classified as counterfeit (see panel (b) of Figure [4]). The estimated 
subspace is quite similar to that obtained by SAVE, which we recall is equivalent to the one we 
would have obtained by fitting an EDDA mixture model with VVV covariance structure. 

Fitting a MclustDA mixture model we obtain the graphs in panels (c) and (d) of Figure Q. 
The model selected by BIC uses a three components mixture with common covariance structure 
for the group of counterfeits, and a single component mixture model for the group of genuine 
notes. The latter appears as an homogeneous group, whereas the counterfeits are more het¬ 
erogeneous with a clear separated cluster of observations (see panel (c)). Finally, panel (d) 
of Figure Q shows the classification boundaries which are clearly nonlinear in this case and 
classify correctly all the observed banknotes. 


Flury and Riedwyl (1988 


8 








Figure 4: Plot of genuine (g) and counterfeit (c) Swiss banknotes projected onto the first two 
directions estimated for the “best” EDDA mixture model (top panels) and the “best” MclustDA 
mixture model (bottom panels). Panels (a) and (c) show the class density contours, while panels 
(b) and (d) plot the decision boundaries with corresponding uncertainty. 


5.3 Simulated data with irrelevant and redundant features 


GMMDRC directions are able to identify those variables which contain information about the 
classification structure. The estimated basis of the subspace is thus formed by linear combina¬ 
tions of the original features. However, when irrelevant and/or redundant correlated variables 
are present, the corresponding estimated coefficients have negligible values. 

Consider the synthetic data example described in Maugis et al (2009, Sec. 6, scenario 5). A 


sample of size n = 200 is generated for a 10-dimensional ifeature vector. The first two variables 
are drawn from a mixture of four Gaussian distributions Xn 2 ] ~ N (t J ’ k ,1 2 ) with = (-2, -2), 
/x 2 = (-2,2), /x 3 = — /x 2 , /x 4 = ~H\, and mixing probabilities n = (0.3,0.2,0.3,0.2). The 
remaining eight variables are simulated according to the model *[ 3 , 10 ] = /3 T ®[ 1 , 2 ] + G where 

/3 = j q 3° 4 )’ e ~ with ^ = diag(J 2 ,0.5/ 2 ,J 4 ), and 0 P is the px p 

matrix of zeroes. For this scenario only the first two variables contain relevant information for 
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classification; the following four variables are correlated with the first two and therefore are 
redundant for classification purposes, whereas the remaining variables are independent both 
from the previous variables and from the classification. 

The plot of the sample data projected onto the subspace spanned by the first two GMMDRC 
directions is presented in Figure [5j which also contains the table of coefficients defining the 
basis of the estimated subspace. The first two GMMDRC directions contain all the information 
pertaining to the partition of the classes, with the last direction clearly negligible based on 
the value of the corresponding eigenvalue. Furthermore, the coefficients defining the first two 
directions are very close to zero for all the variables except the first two, those which are really 
needed for classification. 



-6 -4 -2 0 2 4 6 


GMMDRC basis 



Diri 

Dir 2 

Dir 3 

Xi 

-0.729 

-0.481 

-0.368 

X 2 

0.672 

-0.843 

0.701 

X3 

-0.008 

0.012 

-0.169 

X 4 

0.051 

0.055 

0.123 

X5 

0.052 

-0.138 

0.212 

X 6 

0.046 

0.083 

-0.270 

X7 

-0.023 

-0.025 

-0.183 

X 8 

0.033 

-0.087 

-0.363 

Xg 

0.011 

-0.035 

0.033 

XlO 

0.083 

0.143 

0.214 

Eigenvalues 

0.6527 

0.6069 

0.0005 

Cum. % 

51.79 

99.96 

100.00 


Dili 


Figure 5: Plot of simulated data, generated with irrelevant and redundant variables, projected 
onto the subspace spanned by the first two GMMDRC directions. The table at right shows the 
coefficients of the linear combinations that define the estimated directions with the correspond¬ 
ing eigenvalues. 


5.4 High-dimensional data example 


High-dimensional data represents a very challenging problem for many statistical methods, 
particularly when the number of available observations is small compared to the number of 
variables. Finite mixture models may be highly parameterized, thus fitting Gaussian mixtures 
to high-dimensional data requires some form of dimension reduction and/or some form of reg¬ 
ularization. For a recent review see Bouveyron and Brunet-Saumard (|2013). 

Microarray data are an extreme case of high-dimensional data, for the reason that the 
number of variables (genes) is usually much larger than the number of observations (samples). 
For instance, consider the famous gene expression cancer dataset from Golub et al (1999). The 


data contain information on gene expressions in samples from human acute myeloid (AML) and 
acute lymphoblastic (ALL) leukemias obtained from high-density Affymetrix oligonucleotide 
arrays. There are 3571 genes and 38 samples: 27 in class ALL, and 11 in class AML. The 
samples in class ALL could be further split into B-cell and T-cell types. A preliminary filtering 
of genes, based on t-tests with p-values adjusted for multiple comparisons using the [Benj amini 
and Hochberg (1995) method, selected a subset of 731 genes differentially expressed. Then, 


an MclustDA model was fitted on the selected subset assuming a common spherical covariance 
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matrix (Eli) for each component within any class. From this model the matrices M\ and M\\ 
can be estimated as discussed in Section |4.3[ However, to apply the eigen-decomposition ([3]) 


we need a regularized estimate of the marginal covariance matrix. Several approaches could be 
adopted, but here for simplicity we used = diag[s|]j = i r .. iP , where sf is the sample variance 
of the z-th gene. Such estimate ignores correlations between genes, which is not biologically 
realistic, but it has been shown to have no effect on classification accuracy (Dudoit et al, 2002). 

Figure [6^, shows a boxplot of AML and ALL samples projected along the first GMMDRC 
direction, which accounts for about 94% of total variation. A single direction is clearly able to 
separate the two types of cancer. However, the inclusion of the second direction, which accounts 
for another 4%, allows us to highlight an interesting feature previously not evident. Looking at 
Figure^ we see that the group of ALL samples can be further divided into B-cell and T-cell 
tumour types along the second GMMDRC direction, except for one unusual B-cell which is very 
close to the group of T-cell samples. 



(a) (b) 

Figure 6: Plots of Golub data projected along the first GMMDRC direction (a) and the first 
two GMMDRC directions (b). Points are marked as □ for AML samples, as • for ALL B-cell 
and ▲ for ALL T-cell samples. 


6 Finding the most discriminant directions 

The methodology introduced in Section [4] allows to visualize on a reduced subspace the under¬ 
lying characteristics of class densities. However, if groups differ not only on location but also on 
dispersion, this second type of information may be dominant, and the classes would not appear 
clearly separated along the main directions. 

If class separation is the goal, an appropriate modification of the kernel matrix ([2]) should 
be adopted, for instance, using the following convex linear combination 

M = AM|S” 1 M| + (l-A)Mn, (5) 

where 0 < A < 1 is a tuning parameter. By choosing a large A the estimated directions will 
focus more on differences on location. For A = 0.5 we give equal weight to the two types 
of information, while for A = 1 differences in class covariances are completely ignored. More 
generally, we could decide to optimize a measure of class separation, or minimize the uncertainty 
in classification. 
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Recently, Zhu and Hastie (2003) proposed a generalized log-likelihood-ratio (LR) statistic 
criterion to find the relevant directions for classification. They compare their proposal with 
SAVE on a simple bi-dimensional dataset with two groups. Figure Ek shows a data sample 
generated from this setting (for details see the mentioned paper). Zhu and Hastie (2003) argue 
that the relevant direction for discriminating the two groups corresponds to the first variable 
X {, where there are differences in means. This turns out to be the direction selected by the LR 
criterion they proposed. On the contrary, the first direction selected by SAVE corresponds to 
X< 2 , a direction which contains only differences in variances, so the two groups do not appear 
well separated. 



XI 


Dirl 


(a) 


(b) 



X 


Dirl 


(c) 


(d) 


Figure 7: Sample data from 


Zhu and Hastie 


(2003) simulation scheme. Panel (a) contains the 


scatterplot of the two classification variables. Panel (b) and (d) show the data projected along 
the first two estimated GMMDRC directions with, respectively, A = 0.5 (default) and A = 0.75. 
The latter value has been selected on the basis of the LR criterion, whose trace is shown in 
panel (c). 


We fit a Gaussian mixture model to such dataset after adding eight noise variables gener¬ 
ated from independent standard normals. The first two directions appears to be needed with 
associated eigenvalues (0.54923,0.28002), which accounts for a total contribution of about 83%. 
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Figure [7|o shows the data projected along such directions: the first direction correspond es¬ 
sentially to X 2 , while the second direction to X \. As for SAVE, the information coming from 
difference in variances is overwhelming that coming from difference in means, and this is what 
the plot shows. However, if our goal is to look for the most separating directions we can adopt 
the LR criterion for selecting the value of A in equation ([ 5 ]). Figure [?J:: shows the trace of LR 
over a regular grid for A: the optimal value is obtained for A = 0.75 (or greater), which yields 
the projection shown in Figure [7jl. Now, the first estimated direction is essentially equivalent 
to X[. the most discriminating variable, while the second direction is equivalent to X 2 . 

Instead of optimizing the LR criterion as discussed above, we could adopt a different per¬ 
spective based on dynamic graphics. For example, one could imagine to build a dynamic graph 
that, using a slider to manipulate the A parameter, allows us to change interactively the data 
projection. In this way, a user would be able to appreciate the transition between the focus 
directed to differences in location to differences in the dispersion. Furthermore, depending on 
the purpose of the analysis, by tuning the A parameter we could decide to highlight the structure 
of the classes and their characteristics, or to favor the separation of classes. 


6.1 Ionosphere data 


The ionosphere data were collected by 16 high-frequency antennas in Goose Bay, Labrador, 
Canada, and contain information about radar signals returned from the ionosphere. “Good” 
samples are those showing evidence of some type of structure in the ionosphere, while “bad” 
returns are those whose signals pass directly through the ionosphere and show no structure. 
A total of 351 signals were received, 225 were “good” returns and the remaining 126 were 
“bad” returns. The signals were processed using a function of 2 attributes for each of 17 pulse 
numbers that describe the complex electromagnetic signal. There are 34 continuous-valued 
feature variables, although one is a constant of all zeroes. The dataset is taken from the UCI 
Machine Learning Repository and it is available in the R package mlbench. 

Figure [ 8^1 shows the data projected onto the first two GMMDRC directions using the default 
A = 0.5 for a MclustDA mixture model having covariance structure VII with 4 components for 
the “bad” returns, and covariance VVV with 2 components for the “good” returns. On the 
basis of the graph it can be said that the two groups of signals differ mainly in the dispersion, 
with the “bad” returns showing a larger variance and “good” signals which are concentrated 
around the center of the graph. 


These findings are similar to those obtained with SAVE by Pardoe et al (2007). To improve 


groups separation we selected the tuning parameter A using the LR criterion discussed previ¬ 
ously. The graph in Figure [ 8)3 shows the projection onto the first two GMMDRC directions 
estimated using the optimal value A = 1. Here the separation between the two types of signal 
is clearly shown along the second direction, while the group of “good” signals appears to be 
composed of two separable sub-groups along the first direction. The latter is an interesting 
feature not previously recognized. 


7 Final comments 

The paper discussed a dimension reduction method for visualizing the classification structure 
and the geometric characteristics induced by a Gaussian mixture model. The methodology 
can also be easily adapted in order to recover the directions showing the maximum separation 
between the classes. 

Although in the article we have used two-dimensional projections, the proposed method 
can, in principle, be easily extended to subspaces of higher dimensions. However, graphical 
representations on spaces of dimension greater than 3 can be quite difficult. Preliminary results 
for implementing a guided tour in 2 -dimensions seem to be promising. 
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(b) 


Figure 8: Plot of ionosphere data projected onto two different estimated subspaces: (a) using 
the default A = 0.5; (b) using the optimal A = 1 for groups separation. Points marked as □ 
refer to “good” signals, those marked as • to “bad” signals. 


The methodology and corresponding plots discussed in this paper are available in the 
MclustDR function of the R package mclust (Fraley et al, 2012). 
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Proof of proposition [l] 

Assume an EDDA mixture model with common full class covariance matrix. The last condition 
implies that the matrix M\\ in equation ([2]) cancels out, so the kernel matrix simplifies to 

M = 

where M\ = 22fc=i 7T k(l l k ~ pH/Rc ~ r) T = ^ B, the between-class covariance matrix. The 
basis of the subspace <S(/3) provided by GMMDRC is obtained as the solution of the following 
problem 

Mf3j = Ij'ExPj, 

with h > ■ ■ ■ > Id, and d = min(p, K — 1). Thus, (3j is the jth eigenvector associated to the jth 
largest eigenvalue lj (j = 1,... ,d) of the (p x p ) matrix 

S x 1/2 MS“ 1/2 = S^MiS^AfiE^ 172 

= (s- 1 / 2 s B s x 1 / 2 ) T (s x 1 / 2 s e s-. 1/2 ). 

The subspace estimated by SIR is obtained as the solution of 

V B (3f R =lf R V x 0f R (6) 

which is given by the eigen-decomposition of S a -^ 2 SbS x ^ 2 . It is easily seen that (3j = /3~j iR 
and lj = (Z| IR ) 2 , for j = 1,..., d. Thus, the basis of the subspace provided by GMMDRC under 
model EDDA with full common class covariance matrix is equivalent to the basis estimated by 

SIR. 

We now consider the relation of GMMDRC with LDA canonical variates. From © , we 
may subtract 1*Hb/3^ R from both side and, recalling the decomposition of the total variance, 
S y = Sg + we may write 

^ R pf R - lf R S B 0f R = lf R V x pf R - lf R S B pf R 

(1 - lf R )X B pf R = /J IR (S X - S b )^ sir 

V B pf R = lf R /( 1 - lf R )-Z w Pf R - 

It is clear that /j IR /(l —/| IR ) and /j| IR are, respectively, the jth eigenvalue and the associated 

eigenvector of J^ B 'S W / , the decomposition solving the Rayleigh quotient used to derive 

canonical variates in LDA. Thus, the basis of the subspace 5(/3 LDA ) is equivalent to 5(/3 SIR ), 
which in turn is equivalent to that provided by GMMDRC under the specific model assumption. 
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Proof of proposition [2] 

The kernel matrix of SAVE can be written in the original scale of the variables as 


K 


ATsave = f I p - 


k =1 


1/2^-,, y, 1/2 


Recalling that Sx = S B + 5 we nray write the expression within parenthesis as follows: 


Then, 


5V (Sx - ^k)^ x ' = + ^w~ S fc )S 

_ V' - V' — I V* - V 2 


J x 


= s x 1/z e b e x 1/z + S X 1/Z (S W - £,,)£ 


- 1/2 


J x 


K 


M save = (s x 1/2 S b S x 1/2 + £ x 1/2 (£ w - S fe )S x 

fc=i 

= S" 1/2 S b ^ 1 S b E~ 1/2 + 


1/2 V 


if 


S x 1/2 - ^w)^x\^k - 5V) T J £ x 1/2 

= S“ 1 / 2 M|S^M|S - 1/2 + S-- 1 / 2 M||S - 1/2 
= S- 1 / 2 (M|E- 1 M| + M m )S-- 1/2 , 

where M\ and Afn are those obtained from an EDDA Gaussian mixture model with a single 
component for each class and different class covariance matrices (VVV). 


Proof of proposition [3] 


The proof is analogous to that provided for Prop. 2 in Scrucca (2010) and it is not replicated 
here. 
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